Search CORE

17 research outputs found

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Author: Denolf Kristof
Jones Phillip H.
Lo Jack
Qasaimeh Murad
Vissers Kees
Zambreno Joseph
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2019
Field of study

Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and FPGAs), and their associated vendor optimized vision libraries, it becomes a challenge for developers to navigate this fragmented solution space. To aid with determining which embedded platform is most suitable for their application, we conduct a comprehensive benchmark of the run-time performance and energy efficiency of a wide range of vision kernels. We discuss rationales for why a given underlying hardware architecture innately performs well or poorly based on the characteristics of a range of vision kernel categories. Specifically, our study is performed for three commonly used HW accelerators for embedded vision applications: ARM57 CPU, Jetson TX2 GPU and ZCU102 FPGA, using their vendor optimized vision libraries: OpenCV, VisionWorks and xfOpenCV. Our results show that the GPU achieves an energy/frame reduction ratio of 1.1–3.2× compared to the others for simple kernels. While for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2–22.3×. It is also observed that the FPGA performs increasingly better as a vision application’s pipeline complexity grows

Digital Repository @ Iowa State University (ISU)

arXiv.org e-Print Archive

Crossref

A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers

Author: Alser Mohammed
Cavlak Meryem Banu
Corporaal Henk
Denolf Kristof
Firtina Can
Khodamoradi Alireza
Mutlu Onur
Singh Gagandeep
Publication venue
Publication date: 14/04/2023
Field of study

Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The accuracy and speed of basecalling have critical implications for all later steps in genome analysis. Many researchers adopt complex deep learning-based models to perform basecalling without considering the compute demands of such models, which leads to slow, inefficient, and memory-hungry basecallers. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. Our goal is to develop a comprehensive framework for creating deep learning-based basecallers that provide high efficiency and performance. We introduce RUBICON, a framework to develop hardware-optimized basecallers. RUBICON consists of two novel machine-learning techniques that are specifically designed for basecalling. First, we introduce the first quantization-aware basecalling neural architecture search (QABAS) framework to specialize the basecalling neural network architecture for a given hardware acceleration platform while jointly exploring and finding the best bit-width precision for each neural network layer. Second, we develop SkipClip, the first technique to remove the skip connections present in modern basecallers to greatly reduce resource and storage requirements without any loss in basecalling accuracy. We demonstrate the benefits of RUBICON by developing RUBICALL, the first hardware-optimized basecaller that performs fast and accurate basecalling. Compared to the fastest state-of-the-art basecaller, RUBICALL provides a 3.96x speedup with 2.97% higher accuracy. We show that RUBICON helps researchers develop hardware-optimized basecallers that are superior to expert-designed models

arXiv.org e-Print Archive

Tailor: Altering Skip Connections for Resource-Efficient Inference

Author: Denolf Kristof
Duarte Javier Mauricio
Kastner Ryan
Khodamoradi Alireza
Koushanfar Farinaz
Loncar Vladimir
Marcano Gabriel
Meza Andres
Sheybani Nojan
Weng Olivia
Publication venue
Publication date: 15/09/2023
Field of study

Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network's skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor, a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network's skip connections to lower their hardware cost. Tailor improves resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs for on-chip, dataflow-style architectures. Tailor increases performance by 30% and reduces memory bandwidth by 45% for a 2D processing element array architecture

arXiv.org e-Print Archive

Microscaling Data Formats for Deep Learning

Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replacement for baseline FP32 for AI inference and training with low user friction. We also show the first instance of training generative language models at sub-8-bit weights, activations, and gradients with minimal accuracy loss and no modifications to the training recipe

arXiv.org e-Print Archive

Performance and Complexity Co-evaluation of the Advanced Video Coding Standard for Cost-Effective Multimedia Communications

Author: Blanch Carolina
Bormans Jan
Denolf Kristof
Lafruit Gauthier
Saponara Sergio
Publication venue: SpringerOpen
Publication date: 01/01/2004
Field of study

<p/> <p>The advanced video codec (AVC) standard, recently defined by a joint video team (JVT) of ITU-T and ISO/IEC, is introduced in this paper together with its performance and complexity co-evaluation. While the basic framework is similar to the motion-compensated hybrid scheme of previous video coding standards, additional tools improve the compression efficiency at the expense of an increased implementation cost. As a first step to bridge the gap between the algorithmic design of a complex multimedia system and its cost-effective realization, a high-level co-evaluation approach is proposed and applied to a real-life AVC design. An exhaustive analysis of the codec compression efficiency versus complexity (memory and computational costs) project space is carried out at the early algorithmic design phase. If all new coding features are used, the improved AVC compression efficiency (up to 50% compared to current video coding technology) comes with a complexity increase of a factor 2 for the decoder and larger than one order of magnitude for the encoder. This represents a challenge for resource-constrained multimedia systems such as wireless devices or high-volume consumer electronics. The analysis also highlights important properties of the AVC framework allowing for complexity reduction at the high system level: when combining the new coding features, the implementation complexity accumulates, while the global compression efficiency saturates. Thus, a proper use of the AVC tools maintains the same performance as the most complex configuration while considerably reducing complexity. The reported results provide inputs to assist the profile definition in the standard, highlight the AVC bottlenecks, and select optimal trade-offs between algorithmic performance and complexity.</p

Crossref

Directory of Open Access Journals

Archivio della Ricerca - Università di Pisa

Exploiting the Expressiveness of Cyclo-Static Dataflow to Model Multimedia Implementations

Author: Diederik Verkest
Henk Corporaal
Johan Cockx
Kristof Denolf
Marco Bekooij
Publication venue: SpringerOpen
Publication date: 01/01/2007
Field of study

The design of increasingly complex and concurrent multimedia systems requires a description at a higher abstraction level. Using an appropriate model of computation helps to reason about the system and enables design time analysis methods. The nature of multimedia processing matches in many cases well with cyclo-static dataflow (CSDF), making it a suitable model. However, channels in an implementation often use for cost reasons a kind of shared buffer that cannot be directly described in CSDF. This paper shows how such implementation specific aspects can be expressed in CSDF without the need for extensions. Consequently, the CSDF graph remains completely analyzable and allows reasoning about its temporal behavior. The obtained relation between model and implementation enables a buffer capacity analysis on the model while assuring the throughput of the final implementation. The capabilities of the approach are demonstrated by analyzing the temporal behavior of an MPEG-4 video encoder with a CSDF graph

Repository TU/e

Springer - Publisher Connector

Pure OAI Repository

Directory of Open Access Journals

Scaling Bandwidth and Complexity of Hybrid MC/DCT Video

Author: Christophe De Vleeschouwer
Gauthier Lafruit
Jan Bormans
Kristof Denolf
Publication venue
Publication date
Field of study

Abstract. Multi-user video conferencing has high and diverse bandwidth and performance requirements. We analyse the scaling behaviour of a standard hybrid video codec by varying the quantization degree, the temporal and spatial resolution. We define a framework to select the set of coding parameters when the bandwidth and complexity constraints are known. Additionally, the efficiency of simulcast and layered solutions to overcome large bandwidth and/or computational resource variations among the participants are compared. In this context, the temporal resolution turns out to be the best-suited parameter to provide multiple resolutions through independent streams dealing with the network unreliability.

CiteSeerX